Online Retail Clean merupakan sebuah perusahaan jasa retail data set dari data transaksi berisi data customer ID, Frequency dan Monetary
Cutomer ID : adalah ID unik yang dimiliki oleh masing-masing pelanggan Recency : adalah jumlah hari dari hari terakhir customer membeli ( satuan hari)
Frequency : adalah jumlah pembelian yang dilakukan oleh customer ( satuan kali)
Monetary : adalah total nilai pembelian dari customer ( satuan Dollar)
library(dplyr)
library(plotly)
library(factoextra)
library(cluster)
library(clValid)
x<-read.csv('https://raw.githubusercontent.com/arikunco/machinelearning/master/dataset/online_retail_clean.csv')
print(summary(x))
CustomerID recency frequency monetary
Min. :12347 Min. : 21.47 Min. : 1.000 Min. :-38970.00
1st Qu.:13752 1st Qu.: 47.43 1st Qu.: 1.000 1st Qu.: 13.20
Median :15249 Median : 92.39 Median : 1.000 Median : 26.10
Mean :15270 Mean :133.05 Mean : 2.259 Mean : 52.63
3rd Qu.:16792 3rd Qu.:203.47 3rd Qu.: 2.000 3rd Qu.: 56.73
Max. :18283 Max. :394.51 Max. :86.000 Max. : 10122.56
Gunakan R sudio Desktop dengan OS Windows atau gunakan R Server dengan Google Chrome Browser
p <- plot_ly(x, x = ~recency, y = ~frequency, z = ~monetary) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary')))
p
Dari Summary data dan dari plot data terihat bahwa terdapat data monetary dengan nilai <0
secara logika tidak mungkin jumlah pembelian <0 maka data ini di anggap data invalid yang harus dibuang
selain data NA juga harus dibuang dari data set
x<-filter(x, monetary >= 0) %>% na.omit(x)
print(summary(x))
CustomerID recency frequency monetary
Min. :12347 Min. : 21.47 Min. : 1.000 Min. : 0.00
1st Qu.:13752 1st Qu.: 47.35 1st Qu.: 1.000 1st Qu.: 13.80
Median :15252 Median : 92.22 Median : 1.000 Median : 27.09
Mean :15270 Mean :132.11 Mean : 2.267 Mean : 71.20
3rd Qu.:16790 3rd Qu.:200.60 3rd Qu.: 2.000 3rd Qu.: 57.50
Max. :18283 Max. :394.51 Max. :86.000 Max. :10122.56
Pada metode cluster sangat dipengaruhi jarak antar point pada setiap variable
dikarenakan setiap variable memiliki satuan dan skala yang berbeda, maka perlu dilakukan sclaling dari setiap nilai
pada variable agar memiliki skala yang sama
scale dilakukan dengan menghitung z score dari masing2 nilai variable
x.scale <- as.data.frame(scale(x[,2:4], scale = TRUE))
head(x.scale)
summary(x.scale)
recency frequency monetary
Min. :-1.0627 Min. :-0.36498 Min. :-0.21248
1st Qu.:-0.8141 1st Qu.:-0.36498 1st Qu.:-0.17128
Median :-0.3831 Median :-0.36498 Median :-0.13162
Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.6579 3rd Qu.:-0.07697 3rd Qu.:-0.04089
Max. : 2.5204 Max. :24.11595 Max. :29.99504
x$recency_z <- x.scale$recency
x$frequency_z <- x.scale$frequency
x$monetary_z <- x.scale$monetary
p <- plot_ly(x, x = ~recency_z, y = ~frequency_z, z = ~monetary_z) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary')))
p
Elbow method
fviz_nbclust(x[,5:7], kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
Silhouette method
fviz_nbclust(x[,5:7], kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
Kedua metode menghasilkan jumlah K optimal k=4
km.out <- kmeans(x[,5:7], center=4, nstart=10)
x$cluster <- km.out$cluster
plot(x[,5:7], col = x$cluster,
main = "K-MEANS of RFM")
p <- plot_ly(x, x = ~recency_z, y = ~frequency_z, z = ~monetary_z, color = ~cluster) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary')))
p
fviz_nbclust(x[,5:7], pam, method = "silhouette")+
theme_classic()
jumlah K Optimal adalah k=2
pam.out<-pam(x[,5:7], 2, metric = "euclidean", stand = FALSE)
x$clusterpam <-pam.out$clustering
head(x)
plot(x[,5:7], col = x$clusterpam,
main = "K-MEDOIDS of RFM")
p <- plot_ly(x, x = ~recency_z, y = ~frequency_z, z = ~monetary_z, color = ~clusterpam) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary')))
p
Compute Dissimilarity Matrix
res.dist <- dist(x[,5:7], method = "euclidean")
res.hc <- hclust(d = res.dist, method = "complete")
plot(res.hc)
clmethods <- c("hierarchical","kmeans","pam")
intr <- clValid(x[,5:7], nClust = 2:6, clMethods = clmethods,validation = "internal", maxitems = 2350 ,metric = "euclidean",method = "complete")
summary(intr)
Clustering Methods:
hierarchical kmeans pam
Cluster sizes:
2 3 4 5 6
Validation Measures:
2 3 4 5 6
hierarchical Connectivity 3.8579 10.4567 13.3151 22.8099 25.0266
Dunn 0.5741 0.2713 0.2808 0.2056 0.2504
Silhouette 0.9478 0.9077 0.8763 0.8311 0.8039
kmeans Connectivity 14.4048 12.6524 65.1853 25.7480 102.0024
Dunn 0.0611 0.1926 0.0005 0.0891 0.0017
Silhouette 0.9111 0.9066 0.5582 0.7778 0.5400
pam Connectivity 48.7032 79.6806 99.9222 156.5710 161.3635
Dunn 0.0003 0.0005 0.0003 0.0003 0.0002
Silhouette 0.5291 0.3844 0.4448 0.3302 0.3058
Optimal Scores:
optimalScores(intr)
Diketahui algoritma optimal adalah Hirarki Cluster dengan jumlah cluster = 2
hc.out <- cutree(res.hc, k=2)
x$clusterhc <- hc.out
head(x)
plot(x[,5:7], col = x$clusterhc,
main = "HC of RFM")
p <- plot_ly(x, x = ~recency_z, y = ~frequency_z, z = ~monetary_z, color = ~clusterhc) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency'),
yaxis = list(title = 'Frequency'),
zaxis = list(title = 'Monetary')))
p
NA
Dari plot terlihat pembagian 2 kluster secara bisnis berarti apa-apa. perlu di evaluasi Ulang model dikarenakan tidak sesuai dengan keperluan bisnis tahapan kerja kembali ke Business Understanding dan data preparation
boxplot(x[,5:7])
Data frequency dan monetary terlalu banyak outlier sehingga model kluster yang dihasilkan tidak memiliki arti
boxplot(x$frequency_z)
freq.outlier <- boxplot(x$frequency_z)$out
x.clean1<-x[-which(x$frequency_z %in% freq.outlier),]
boxplot(x.clean1[,5:7])
boxplot(x$monetary_z)
money.outlier <- boxplot(x.clean1$monetary_z)$out
x.clean2<-x.clean1[-which(x.clean1$monetary_z %in% money.outlier),]
boxplot(x.clean2[,5:7])
clmethods <- c("hierarchical","kmeans","pam")
intr <- clValid(x.clean2[,c(6,5,7)], nClust = 2:4, clMethods = clmethods,validation = "internal", maxitems = 2350 ,metric = "euclidean",method = "complete")
summary(intr)
Clustering Methods:
hierarchical kmeans pam
Cluster sizes:
2 3 4
Validation Measures:
2 3 4
hierarchical Connectivity 13.0385 37.0075 47.8889
Dunn 0.0175 0.0120 0.0148
Silhouette 0.6212 0.5208 0.5089
kmeans Connectivity 25.6198 41.7258 74.9028
Dunn 0.0050 0.0140 0.0074
Silhouette 0.6446 0.5686 0.4747
pam Connectivity 23.9464 37.1016 83.5175
Dunn 0.0046 0.0138 0.0010
Silhouette 0.6426 0.5679 0.4437
Optimal Scores:
optimalScores(intr)
Diketahui algoritma optimal dengan dunn index terbaik adalah Hirarki Cluster dengan jumlah cluster = 2
res.dist <- dist(x.clean2[,c(6,5,7)], method = "euclidean")
res.hc <- hclust(d = res.dist, method = "complete")
plot(res.hc)
hc.out <- cutree(res.hc, k=2)
x.clean2$clusterhc <- hc.out
head(x.clean2)
plot(x.clean2[,5:7], col = x.clean2$clusterhc,
main = "HC of RFM with z scale")
plot(x.clean2[,2:4], col = x.clean2$clusterhc,
main = "HC of RFM")
p <- plot_ly(x.clean2, x = ~recency, y = ~frequency, z = ~monetary, color = ~clusterhc) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Recency (days)'),
yaxis = list(title = 'Frequency(times)'),
zaxis = list(title = 'Monetary(dollar)')))
p
Dari data transaksi pelanggan diketahui terdapat 2 kelompok pelanggan
2 kelompok pelanggan lebih di bagi berdasarkan kekinian dari mereka melakukan transaksi
nilai Monetary terdistribsi rapat sehingga dianggap satu kelompok. sedangkan nilai frequency distribusinya terlalu kecil hanya terdiri 3 nilai sehingga dianggap satu kelompok. Pembagian kelompok lebih kepada mempertimbangkan nilai recency Kelompok 1 adalah pembeli yang membeli kurang dari 250 hari
KeLompok 2 adalah pembeli yang membeli lebih dari 250 hari